Exploring MLB Performance and Prediction Utilizing Machine Learning!
PLEASE NOTE “We defaulted the code blocks in the
HTML to hide, unless wanted to be collapsed. This was done for
readability of the graphs and explanations. To see the code, you can
press the Show Button on the right to visualize the code
that was used in those sections.”
The Hitters dataset, which includes pay and performance data for Major League Baseball players from the 1986 and 1987 seasons, is examined in this project. Building a multinomial logistic regression model to forecast salary levels based on player performance measures is our main goal.
Since we are predicting the salary level, we are dealing with a classification problem. This is a multivariable classification problem requiring to use a multinomial logistic regression model instead of simple logistic regression model.
When we use the multinomial logistic regression model we have multi-class outcomes, and K-1 equations, supposing that K is the number of classes.
Here, we loaded essential R libraries for data manipulation and visualization.
knitr::kable() by adding styling and HTML features.ggplot2
plots into one layout (side by side or stacked).ggplot2 or from scratch.autoplot().ggplot2 for exploring variable relationships.head(), file operations, and data inspection tools.mean(), sum(),
etc.).library(readr)
library(dplyr)
library(ggplot2)
library(tidyverse)
library(knitr)
library(kableExtra)
library(nnet)
library(caret)
library(glmnet)
library(patchwork)
library(tibble)
library(plotly)
library(corrplot)
library(factoextra)
library(cluster)
library(ggfortify)
library(GGally)
library(fmsb)
library(broom)
library(tidyr)
library(pROC)
library(xgboost) We loaded the Hitters dataset into R.
| AtBat | Hits | HmRun | Runs | RBI | Walks | Years | CAtBat | CHits | CHmRun | CRuns | CRBI | CWalks | League | Division | PutOuts | Assists | Errors | Salary | NewLeague |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 293 | 66 | 1 | 30 | 29 | 14 | 1 | 293 | 66 | 1 | 30 | 29 | 14 | A | E | 446 | 33 | 20 | NA | A |
| 315 | 81 | 7 | 24 | 38 | 39 | 14 | 3449 | 835 | 69 | 321 | 414 | 375 | N | W | 632 | 43 | 10 | 475.0 | N |
| 479 | 130 | 18 | 66 | 72 | 76 | 3 | 1624 | 457 | 63 | 224 | 266 | 263 | A | W | 880 | 82 | 14 | 480.0 | A |
| 496 | 141 | 20 | 65 | 78 | 37 | 11 | 5628 | 1575 | 225 | 828 | 838 | 354 | N | E | 200 | 11 | 3 | 500.0 | N |
| 321 | 87 | 10 | 39 | 42 | 30 | 2 | 396 | 101 | 12 | 48 | 46 | 33 | N | E | 805 | 40 | 4 | 91.5 | N |
| 594 | 169 | 4 | 74 | 51 | 35 | 11 | 4408 | 1133 | 19 | 501 | 336 | 194 | A | W | 282 | 421 | 25 | 750.0 | A |
Now, we calculate the summary statistics of the dataset to identify any missing values, duplicate rows, or other issues. We do this so we can assess the quality of the data and make sure it’s ready for further analysis and modeling.
n_rows <- nrow(Hitters)
n_cols <- ncol(Hitters)
na_total <- sum(is.na(Hitters))
na_cols <- sum(colSums(is.na(Hitters)) > 0)
dup_index <- anyDuplicated(Hitters)
dataset_summary <- tibble(
Metric = c("Number of Rows",
"Number of Columns",
"Variables with Missing Values",
"Total Missing Values",
"Duplicate Rows Detected"),
Value = c(n_rows,
n_cols,
na_cols,
na_total,
ifelse(dup_index == 0, "No", paste("Yes (row", dup_index, ")")))
)
dataset_summary %>%
kable("html", caption = "Dataset Summary") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "left")| Metric | Value |
|---|---|
| Number of Rows | 317 |
| Number of Columns | 20 |
| Variables with Missing Values | 1 |
| Total Missing Values | 58 |
| Duplicate Rows Detected | No |
We checked the number of rows and columns to understand the datasets structure. The missing data allows us to identify the issues and decide how to approach them.The Salary variable has 58 missing values, which is the target variable we aim to model.
Since we cannot build or evaluate a model with missing outcome values, we remove those rows.
Hitters <- na.omit(Hitters)
n_rows_after <- nrow(Hitters)
na_total_after <- sum(is.na(Hitters))
na_cols_after <- sum(colSums(is.na(Hitters)) > 0)
dup_index_after <- anyDuplicated(Hitters)
dataset_summary_after <- tibble(
Metric = c("Number of Rows",
"Number of Columns",
"Variables with Missing Values",
"Total Missing Values",
"Duplicate Rows Detected"),
Value = c(n_rows_after,
n_cols,
na_cols_after,
na_total_after,
ifelse(dup_index_after == 0, "No", paste("Yes (row", dup_index_after, ")")))
)
dataset_summary_after %>%
kable("html", caption = "Dataset After Removing Missing Values)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "left")| Metric | Value |
|---|---|
| Number of Rows | 259 |
| Number of Columns | 20 |
| Variables with Missing Values | 0 |
| Total Missing Values | 0 |
| Duplicate Rows Detected | No |
After dropping the rows where we are missing the target variable, we have 259 complete observations.
Now we calculate and display the frequency distribution of players across different categories:
League
Division
NewLeague
We do this so we can better understand the distribution of players within each categorical variable, helping us analyze any patterns or imbalances in the dataset.
league_tbl <- Hitters %>%
count(League, name = "Number of Players") %>%
rename(`Category Value` = League) %>%
mutate(`Categorical Variable` = "League")
division_tbl <- Hitters %>%
count(Division, name = "Number of Players") %>%
rename(`Category Value` = Division) %>%
mutate(`Categorical Variable` = "Division")
newleague_tbl <- Hitters %>%
count(NewLeague, name = "Number of Players") %>%
rename(`Category Value` = NewLeague) %>%
mutate(`Categorical Variable` = "New League")
combined_freq <- bind_rows(league_tbl, division_tbl, newleague_tbl) %>%
select(`Categorical Variable`, `Category Value`, `Number of Players`)
combined_freq %>%
kable("html", caption = "Number of Players in Each Category of League, Division, and New League") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = TRUE,
position = "center"
)| Categorical Variable | Category Value | Number of Players |
|---|---|---|
| League | A | 137 |
| League | N | 122 |
| Division | E | 128 |
| Division | W | 131 |
| New League | A | 139 |
| New League | N | 120 |
For League (A vs N) we can see that there is a slight imbalance with A being more frequent. Division (E vs W) is fairly balanced, so both divisions are represented equally. New League is similar to League.
Now we should convert the League, Division, and NewLeague variables to factors to ensure they are properly encoded for categorical analysis.
We will then generate a summary of the dataset’s structure, including variable names, data types, and sample values, to better understand the content and types of data in each column. This will help us verify that the data is correctly formatted for analysis and modeling.
Hitters$League <- as.factor(Hitters$League)
Hitters$Division <- as.factor(Hitters$Division)
Hitters$NewLeague <- as.factor(Hitters$NewLeague)
structure_summary <- tibble(
`Variable Name` = names(Hitters),
`Data Type` = sapply(Hitters, function(x) class(x)[1]),
`Sample Values` = sapply(Hitters, function(x) {
if (is.factor(x)) {
paste0(as.numeric(head(x, 5)), collapse = ", ")
} else {
paste0(head(x, 5), collapse = ", ")
}
})
)
structure_summary %>%
kable("html", caption = "Detailed Structure of the Hitters Dataset After Encoding") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = TRUE,
position = "center"
)| Variable Name | Data Type | Sample Values |
|---|---|---|
| AtBat | numeric | 315, 479, 496, 321, 594 |
| Hits | numeric | 81, 130, 141, 87, 169 |
| HmRun | numeric | 7, 18, 20, 10, 4 |
| Runs | numeric | 24, 66, 65, 39, 74 |
| RBI | numeric | 38, 72, 78, 42, 51 |
| Walks | numeric | 39, 76, 37, 30, 35 |
| Years | numeric | 14, 3, 11, 2, 11 |
| CAtBat | numeric | 3449, 1624, 5628, 396, 4408 |
| CHits | numeric | 835, 457, 1575, 101, 1133 |
| CHmRun | numeric | 69, 63, 225, 12, 19 |
| CRuns | numeric | 321, 224, 828, 48, 501 |
| CRBI | numeric | 414, 266, 838, 46, 336 |
| CWalks | numeric | 375, 263, 354, 33, 194 |
| League | factor | 2, 1, 2, 2, 1 |
| Division | factor | 2, 2, 1, 1, 2 |
| PutOuts | numeric | 632, 880, 200, 805, 282 |
| Assists | numeric | 43, 82, 11, 40, 421 |
| Errors | numeric | 10, 14, 3, 4, 25 |
| Salary | numeric | 475, 480, 500, 91.5, 750 |
| NewLeague | factor | 2, 1, 2, 2, 1 |
Since we are working with multinomial logistic regression, it’s important to scale the numeric predictors. This is because optimization algorithms in these models perform better when the numeric variables are on a similar scale.
In this step, we standardized the numeric columns of the Hitters dataset by setting their mean to 0 and their standard deviation to 1. This ensures that each variable contributes equally to the model.
Hitters_scaled <- Hitters
Hitters_scaled[, sapply(Hitters_scaled, is.numeric)] <-
scale(Hitters_scaled[, sapply(Hitters_scaled, is.numeric)])
scaled_vars <- names(Hitters_scaled)[sapply(Hitters_scaled, is.numeric)]
scaled_summary <- tibble(
`Variable Name` = scaled_vars,
`Data Type` = sapply(Hitters_scaled[scaled_vars], function(x) class(x)[1]),
`Scaled Sample Values` = sapply(Hitters_scaled[scaled_vars], function(x) paste0(round(head(x, 5), 2), collapse = ", "))
)
scaled_summary %>%
kable("html", caption = "Scaled Numeric Variables After Standardization") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = TRUE,
position = "center"
)| Variable Name | Data Type | Scaled Sample Values |
|---|---|---|
| AtBat | numeric | -0.6, 0.51, 0.63, -0.56, 1.29 |
| Hits | numeric | -0.6, 0.49, 0.73, -0.46, 1.35 |
| HmRun | numeric | -0.52, 0.73, 0.96, -0.18, -0.86 |
| Runs | numeric | -1.2, 0.44, 0.4, -0.62, 0.76 |
| RBI | numeric | -0.52, 0.8, 1.03, -0.36, -0.01 |
| Walks | numeric | -0.08, 1.65, -0.18, -0.5, -0.27 |
| Years | numeric | 1.4, -0.89, 0.78, -1.1, 0.78 |
| CAtBat | numeric | 0.35, -0.45, 1.3, -0.98, 0.77 |
| CHits | numeric | 0.18, -0.4, 1.32, -0.95, 0.64 |
| CHmRun | numeric | 0.01, -0.06, 1.92, -0.69, -0.6 |
| CRuns | numeric | -0.12, -0.41, 1.42, -0.94, 0.43 |
| CRBI | numeric | 0.27, -0.19, 1.58, -0.87, 0.03 |
| CWalks | numeric | 0.45, 0.02, 0.37, -0.86, -0.24 |
| PutOuts | numeric | 1.21, 2.09, -0.33, 1.82, -0.04 |
| Assists | numeric | -0.53, -0.26, -0.75, -0.55, 2.06 |
| Errors | numeric | 0.2, 0.81, -0.86, -0.71, 2.47 |
| Salary | numeric | -0.13, -0.12, -0.07, -0.98, 0.48 |
The result is the dataset where the numeric columns have been standardized, and categorical columns (League, Division, NewLeague) remain unchanged as factor variables.
This visualization will help us understand the distribution of salaries in the dataset. By displaying both the histogram and density curve, we can assess the shape of the salary distribution. This is useful for identifying outliers, skewness, or any other patterns in the data that might affect model performance.
hist(Hitters_scaled$Salary,
main = "Histogram of Salary",
xlab = "Salary (in thousands)",
col = "#f59ff2",
border = "white",
breaks = 20,
probability = TRUE)
lines(density(Hitters_scaled$Salary, na.rm = TRUE),
col = "#c44d08",
lwd = 2)
abline(v = quantile(Hitters_scaled$Salary, probs = c(0.25, 0.5, 0.75), na.rm = TRUE),
col = "black",
lty = 2)
legend("topright",
legend = c("Density", "Quartiles"),
col = c("#c44d08", "black"),
lty = c(1, 2),
bty = "n")
As we can observe, the salary distribution is highly
right-skewed. A large portion of the players earn relatively
low salaries, and only a small number have very high salaries.
This skewness causes the quantiles to be unevenly
spaced, with the first two quantiles quite close together and a
longer stretch between the top quantiles.
Despite the skew seen above, dividing the salaries into quantiles still allows for a fair classification based on player rank and earnings. We will proceed to split the target variable into four categories:
BenchWarmer (lowest)
Starter
Pro
AllStar (highest)
After splitting the target variable into the new classes, we can now visualize them separately and see their distribution.
Hitters_scaled$SalaryLevel <- factor(Hitters_scaled$SalaryLevel,
levels = c("BenchWarmer", "Starter", "Pro", "AllStar"))
ggplot(Hitters_scaled, aes(x = SalaryLevel, fill = SalaryLevel)) +
geom_bar(color = "black", width = 0.7) +
scale_fill_manual(values = c(
"BenchWarmer" = "#f7a1f0",
"Starter" = "#f772ec",
"Pro" = "#a30596",
"AllStar" = "#470142"
)) +
labs(
title = "Distribution of Salary Levels - Quartiles",
x = "Salary Category",
y = "Number of Players",
fill = "Salary Level"
) +
theme_minimal(base_size = 14) +
theme(
plot.title = element_text(size = 16, face = "bold", hjust = 0.5),
axis.text = element_text(size = 12),
legend.position = "none"
)Observations & Assumptions:
This chart shows the distribution of players across the newly created categories. The counts on the Y-axis represent the number of players in each category.
We can see that the distribution is fairly balanced, with each category containing a similar number of players. This balance is important for building a classification model, as it helps avoid bias toward any one class and ensures the model can learn equally from each salary group.
These plots will show us how many players belong to each category.
pie_league <- Hitters_scaled %>%
count(League) %>%
ggplot(aes(x = "", y = n, fill = League)) +
geom_col(width = 1, color = "white") +
coord_polar(theta = "y") +
labs(title = "League Distribution", fill = "League") +
theme_void(base_size = 13) +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
) +
scale_fill_manual(values = c("#fbb1bd", "#ffd59a"))
pie_division <- Hitters_scaled %>%
count(Division) %>%
ggplot(aes(x = "", y = n, fill = Division)) +
geom_col(width = 1, color = "white") +
coord_polar(theta = "y") +
labs(title = "Division Distribution", fill = "Division") +
theme_void(base_size = 13) +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
) +
scale_fill_manual(values = c("#fff6a9", "#a8e6cf"))
pie_newleague <- Hitters_scaled %>%
count(NewLeague) %>%
ggplot(aes(x = "", y = n, fill = NewLeague)) +
geom_col(width = 1, color = "white") +
coord_polar(theta = "y") +
labs(title = "New League Distribution", fill = "New League") +
theme_void(base_size = 13) +
theme(
plot.title = element_text(size = 14, face = "bold", hjust = 0),
legend.title = element_text(size = 12),
legend.text = element_text(size = 10)
) +
scale_fill_manual(values = c("#aedff7", "#cdb4db"))
pie_league + pie_division + pie_newleagueObservations & Assumptions:
These pie charts show that player distributions across League, Division, and New League are quite balanced.
These stacked bar plots of Salary Level within League, Division, and NewLeague allow us to visualize the distribution of salary categories across each of these categorical variables. This helps to identify any patterns or associations between the salary levels and these factors, and to check for potential imbalances or biases within the groups.
ggplot(Hitters_scaled, aes(x = League, fill = SalaryLevel)) +
geom_bar(position = "fill") +
labs(
title = "SalaryLevel Distribution by League",
x = "League",
y = "Proportion",
fill = "Salary Level"
) +
scale_fill_brewer(palette = "Set2") +
theme_minimal()ggplot(Hitters_scaled, aes(x = Division, fill = SalaryLevel)) +
geom_bar(position = "fill") +
labs(
title = "SalaryLevel Distribution by Division",
x = "Division",
y = "Proportion",
fill = "Salary Level"
) +
scale_fill_brewer(palette = "Set3") +
theme_minimal()ggplot(Hitters_scaled, aes(x = NewLeague, fill = SalaryLevel)) +
geom_bar(position = "fill") +
labs(
title = "SalaryLevel Distribution by NewLeague",
x = "NewLeague",
y = "Proportion",
fill = "Salary Level"
) +
scale_fill_brewer(palette = "Pastel1") +
theme_minimal()Observations & Assumptions:
League: both leagues have almost equally balanced proportions of the salary levels.
Division: here we can notice a bit of a difference in salary distribution among the two devisions (E & W). Division W has more Pro and BenchWarmer players, while division E has more AllStar players. This may indicate that Division E includes stronger or more experienced players.
NewLeague: we obtain similar results as in the original league, since both NewLeagues are relatively balanced.
This test is used to determine if there is a significant association between two categorical variables. By comparing the observed frequencies with the expected frequencies, we can assess whether the distribution of one variable is independent of the other.
So we check if factors like League, Division, and SalaryLevel are related.
test_league <- chisq.test(Hitters_scaled$League, Hitters_scaled$SalaryLevel)
test_division <- chisq.test(Hitters_scaled$Division, Hitters_scaled$SalaryLevel)
test_newleague <- chisq.test(Hitters_scaled$NewLeague, Hitters_scaled$SalaryLevel)
chisq_results <- tibble::tibble(
Comparison = c("League vs SalaryLevel", "Division vs SalaryLevel", "NewLeague vs SalaryLevel"),
`X-squared` = c(round(test_league$statistic, 3),
round(test_division$statistic, 3),
round(test_newleague$statistic, 3)),
df = c(test_league$parameter,
test_division$parameter,
test_newleague$parameter),
`p-value` = c(round(test_league$p.value, 4),
round(test_division$p.value, 4),
round(test_newleague$p.value, 4)),
`Significant?` = ifelse(c(test_league$p.value,
test_division$p.value,
test_newleague$p.value) < 0.05,
"Yes", "No")
)
chisq_results %>%
kable("html", caption = "Chi-Square Test Results for Categorical Variables",
col.names = c("Comparison", "X² Statistic", "Degrees of Freedom", "p-value", "Significant?"),
align = "c") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center") %>%
column_spec(5, bold = TRUE, color = "white",
background = ifelse(chisq_results$`Significant?` == "Yes", "#28a745", "#dc3545"))| Comparison | X² Statistic | Degrees of Freedom | p-value | Significant? |
|---|---|---|---|---|
| League vs SalaryLevel | 3.215 | 3 | 0.3596 | No |
| Division vs SalaryLevel | 7.356 | 3 | 0.0614 | No |
| NewLeague vs SalaryLevel | 2.803 | 3 | 0.4229 | No |
Observations & Assumptions:
Analyzing p-value and chi-square: - The Chi-Square test results indicate that there is no significant association between League and SalaryLevel (p-value = 0.3596), as the p-value is greater than the typical significance level of 0.05.
Now also for numeric variables we want to understand and visualize the distribution, spread, and central tendency of continuous data, allowing us to identify outliers, skewness, and potential data issues.
Analyzing these aspects ensures that the numeric variables are properly understood and see if further transformations or data cleaning is necessary.
numeric_vars <- names(Hitters_scaled)[sapply(Hitters_scaled, is.numeric)]
Hitters_scaled_long <- Hitters_scaled %>%
select(SalaryLevel, all_of(numeric_vars)) %>%
pivot_longer(-SalaryLevel, names_to = "Variable", values_to = "Value")
p <- ggplot(Hitters_scaled_long, aes(x = SalaryLevel, y = Value, fill = SalaryLevel)) +
geom_boxplot() +
facet_wrap(~ Variable, scales = "free", ncol = 3) +
scale_fill_manual(values = c(
"BenchWarmer" = "#f57a9f",
"Starter" = "#87fabb",
"Pro" = "#f087fa",
"AllStar" = "#59e9ff"
)) +
labs(
title = "Numeric Variables by Salary Level",
x = "Salary Level",
y = "Value"
) +
theme_minimal(base_size = 12) +
theme(
legend.position = "right",
strip.text = element_text(size = 13),
plot.title = element_text(size = 16, face = "bold"),
axis.text.x = element_text(angle = 0, hjust = 1)
)
ggplotly(p, width = 1200, height = 1000)From these Boxplots we can see that CHmRun seems like a potentially important predictor of salary level. Players with more home runs tend to earn more.
A summary of all the numeric columns in the dataset with basic descriptive statistics such as the minimum, maximum, mean, median, and quartiles can help us to assess the distribution and spread of the numeric variables.
library(psych)
describe(select(Hitters_scaled, where(is.numeric))) %>%
kable("html", caption = "Descriptive Summary of Numeric Features") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center")| vars | n | mean | sd | median | trimmed | mad | min | max | range | skew | kurtosis | se | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AtBat | 1 | 259 | 0 | 1 | 0.0623693 | 0.0146159 | 1.2599634 | -2.6163064 | 1.925205 | 4.541511 | -0.1307147 | -0.9924099 | 0.062137 |
| Hits | 2 | 259 | 0 | 1 | -0.1091085 | -0.0299968 | 1.1501121 | -2.3698364 | 2.883031 | 5.252868 | 0.2518997 | -0.5725599 | 0.062137 |
| HmRun | 3 | 259 | 0 | 1 | -0.2926629 | -0.1018084 | 1.0109203 | -1.3154475 | 3.230262 | 4.545709 | 0.7977121 | -0.2240806 | 0.062137 |
| Runs | 4 | 259 | 0 | 1 | -0.1065648 | -0.0425988 | 1.1625051 | -2.1452220 | 2.951421 | 5.096643 | 0.3611305 | -0.5957662 | 0.062137 |
| RBI | 5 | 259 | 0 | 1 | -0.1666393 | -0.0694059 | 1.0329962 | -1.9859215 | 2.697763 | 4.683684 | 0.5734851 | -0.4074762 | 0.062137 |
| Walks | 6 | 259 | 0 | 1 | -0.1763757 | -0.0620567 | 1.1091439 | -1.9063738 | 3.003080 | 4.909454 | 0.5402856 | -0.3897742 | 0.062137 |
| Years | 7 | 259 | 0 | 1 | -0.2639175 | -0.0964012 | 0.9269137 | -1.3059087 | 3.487251 | 4.793159 | 0.8219781 | -0.0618169 | 0.062137 |
| CAtBat | 8 | 259 | 0 | 1 | -0.3132466 | -0.1387948 | 0.8221824 | -1.1468249 | 4.981220 | 6.128045 | 1.3235898 | 1.9884306 | 0.062137 |
| CHits | 9 | 259 | 0 | 1 | -0.3222000 | -0.1463383 | 0.8121040 | -1.1007530 | 5.441554 | 6.542307 | 1.4613915 | 2.9264773 | 0.062137 |
| CHmRun | 10 | 259 | 0 | 1 | -0.3583728 | -0.2033534 | 0.5439957 | -0.8353689 | 5.867038 | 6.702407 | 2.2267155 | 6.1516436 | 0.062137 |
| CRuns | 11 | 259 | 0 | 1 | -0.3326925 | -0.1495450 | 0.7563775 | -1.0783250 | 5.451243 | 6.529568 | 1.5342043 | 3.1227428 | 0.062137 |
| CRBI | 12 | 259 | 0 | 1 | -0.3118477 | -0.1786752 | 0.7030723 | -1.0061244 | 4.126564 | 5.132689 | 1.5476853 | 1.9809695 | 0.062137 |
| CWalks | 13 | 259 | 0 | 1 | -0.3163597 | -0.1761411 | 0.6645081 | -0.9790896 | 5.016126 | 5.995215 | 1.8975455 | 4.3284392 | 0.062137 |
| PutOuts | 14 | 259 | 0 | 1 | -0.2423863 | -0.1984242 | 0.5583746 | -1.0382593 | 3.854228 | 4.892488 | 2.0417089 | 3.9988960 | 0.062137 |
| Assists | 15 | 259 | 0 | 1 | -0.5115561 | -0.1654610 | 0.4380144 | -0.8276046 | 2.552740 | 3.380345 | 1.1404806 | -0.0044866 | 0.062137 |
| Errors | 16 | 259 | 0 | 1 | -0.2527814 | -0.1059203 | 0.8966874 | -1.3111943 | 3.527264 | 4.838459 | 0.9311802 | 0.2738200 | 0.062137 |
Observations & Assumptions:
The summary statistics for the variables show that most of the numerical features in the dataset are centered around a mean of 0, after standardization. Several variables, such as HmRun (home runs) and CAtBat (at-bats), show high maximum values, suggesting some players have significantly higher performance metrics. The distribution of variables like Walks, Hits, and Runs also varies, with some players having extreme values, which might indicate outliers in the dataset.
We can run an Analysis of Variance for each numeric variable testing whether there are significant differences in the means of these variables across the salary levels. For each variable, it will create a model with SalaryLevel as the factor and the numeric variable as the response, then the results will show if the salary level has a significant effect on the numeric variables.
numeric_vars <- c("AtBat", "Hits", "HmRun", "Runs", "RBI",
"Walks", "Years", "PutOuts", "Assists", "Errors")
anova_results <- lapply(numeric_vars, function(var) {
model <- aov(as.formula(paste(var, "~ SalaryLevel")), data = Hitters_scaled)
tidy(model) %>%
filter(term == "SalaryLevel") %>%
mutate(Variable = var) %>%
select(Variable, df, statistic, p.value)
}) %>%
bind_rows() %>%
mutate(
statistic = round(statistic, 3),
p.display = ifelse(p.value < 0.0001, "< 0.0001", round(p.value, 6)),
`Significant?` = ifelse(p.value < 0.05, "Yes", "No")
) %>%
select(Variable, df, statistic, p.display, `Significant?`)
anova_results %>%
kable("html", caption = "ANOVA Test Results: SalaryLevel vs Numeric Variables",
col.names = c("Variable", "DF", "F-Statistic", "p-value", "Significant?"),
align = "c") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = TRUE, position = "center") %>%
column_spec(5, bold = TRUE, color = "white",
background = ifelse(anova_results$`Significant?` == "Yes", "#28a745", "#dc3545"))| Variable | DF | F-Statistic | p-value | Significant? |
|---|---|---|---|---|
| AtBat | 3 | 17.223 | < 0.0001 | Yes |
| Hits | 3 | 20.103 | < 0.0001 | Yes |
| HmRun | 3 | 11.035 | < 0.0001 | Yes |
| Runs | 3 | 16.511 | < 0.0001 | Yes |
| RBI | 3 | 18.804 | < 0.0001 | Yes |
| Walks | 3 | 17.625 | < 0.0001 | Yes |
| Years | 3 | 42.055 | < 0.0001 | Yes |
| PutOuts | 3 | 8.261 | < 0.0001 | Yes |
| Assists | 3 | 0.821 | 0.483453 | No |
| Errors | 3 | 1.972 | 0.118687 | No |
Observations & Assumptions:
The ANOVA results clearly indicate whether the numerical predictors significantly differ across the salary categories (BenchWarmer, Stater, Pro, AllStar). We are searching for variables that have a with large F-values and small p-values, as they should be included as strong predictors in the logistic regression modeling step. Just from looking at the ANOVA test we can conclude that AtBat, Hits, HmRun, Runs, RBI, Walks, Years, PutOuts are significant, while Errors and Assists not.
numeric_data <- Hitters_scaled %>%
select(SalaryLevel, where(is.numeric)) %>%
pivot_longer(cols = -SalaryLevel, names_to = "Variable", values_to = "Value")
ggplot(numeric_data, aes(x = Value, fill = SalaryLevel)) +
geom_density(alpha = 0.5, color = "white", linewidth = 0.4) +
facet_wrap(~ Variable, scales = "free_y", ncol = 3) + # free_y lets y-axis adapt to each variable
scale_fill_manual(values = c(
"BenchWarmer" = "#f5427b",
"Starter" = "#ca68ed",
"Pro" = "#15ed27",
"AllStar" = "#15c6ed"
)) +
theme_minimal(base_size = 18) +
theme(
plot.title = element_text(size = 22, face = "bold", hjust = 0.5),
strip.text = element_text(size = 16, face = "bold"),
axis.text = element_text(size = 13),
legend.title = element_text(size = 14),
legend.text = element_text(size = 13),
panel.spacing = unit(2, "lines"),
legend.position = "bottom",
aspect.ratio = 1
) +
labs(
title = "Density Plots of Player Statistics by Salary Level",
x = "Standardized Value",
y = "Density",
fill = "Salary Level"
)Observations & Assumptions:
We illustrated the distribution of standardized player performance statistics across four salary levels (BenchWarmer, Starter, Pro, AllStar).
Hits, RBI, Runs, and Years are seen as key predictors due to their clear distinction among salary categories.
A correlation matrix shows the pairwise correlations between numeric variables, from this we identify relationships and potential multicollinearity issues among the features, which can guide feature selection for modeling.
cor_matrix <- cor(select(Hitters_scaled, where(is.numeric)))
pink_palette <- colorRampPalette(c("#fff5fa", "#fad4fa", "#faa2fa", "#802f80", "#380538"))(200)
corrplot(cor_matrix,
method = "color",
type = "upper",
col = pink_palette,
tl.col = "black",
tl.srt = 45)Observations & Assumptions:
The correlation matrix reveals how certain variables are strongly related to each other. This is helpful for identifying which variables to include in the model, as using highly correlated features may introduce redundancy and reduce model performance due to multicollinearity.
Here are the key observations: - Hits & AtBat: Show a very strong positive correlation (dark blue), meaning players who have more at-bats tend to accumulate more hits.
CRuns & CHits: Also strongly positively correlated, reflecting cumulative offensive performance.
Assists & Errors: Display a strong negative correlation (dark red). Players with more assists tend to commit fewer errors, indicating defensive reliability.
RBI & CHits / CRuns / Hits: RBI (Runs Batted In) has a very strong positive correlation with these offensive stats, which suggests they represent similar player performance patterns. Including all may be redundant.
Walks & CWalks: Highly correlated, as expected — players who walk more in a season also tend to have a high career walk count.
CHmRun & CHits: Another example of very strong correlation, representing cumulative offensive strength.
Based on the Exploratory Data Analysis we have performed, including visualizations, ANOVA, Chi-square tests, and correlation analysis, we can classify features clearly:
Highly significant features are: Years (which has shown to be the most significant), Hits, AtBat, Runs, RBI, Walks, HmRun, PutOuts (they are less strong, but still significant).
Less Significant features are: Assists (p-value = 0.67 so not significant), Errors (p-value = 0.143 so not significant), League (Chi-square p-value = 0.5312 so not significant), Division (Chi-square p-value = 0.08088 which is borderline significant), NewLeague (Chi-square p-value = 0.1163 so not significant).
However at the first stage we decide to keep all the features.
In this step, we perform a clustering analysis on the player statistics using the K-means algorithm. The purpose of this is to group players with similar performance profiles into clusters, which can help identify distinct player types based on their stats.
We use the Elbow Method to determine the optimal number of clusters for K-means. The “elbow” point, where the decrease in WSS begins to slow down, indicates the optimal number of clusters to choose. This helps us decide the most appropriate clustering configuration for the data.
numeric_data <- select(Hitters_scaled, where(is.numeric))
fviz_nbclust(numeric_data, kmeans, method = "wss") +
theme_minimal() +
labs(title = "Elbow Method - Optimal Number of Clusters")Observations & Assumptions:
The elbow seems to occur around k = 3 or 4, which is the point of maximum curvature. So we will choose k = 4 clusters — this provides a good balance between:
Capturing real structure in the data
Avoiding unnecessary complexity
We applied K-means clustering on player statistics to explore whether players naturally group based on performance — without using salary labels. The elbow method indicated that 4 clusters best fit the data.
set.seed(123)
kmeans_result <- kmeans(numeric_data, centers = 4, nstart = 25)
Hitters_scaled$Cluster <- as.factor(kmeans_result$cluster)
fviz_cluster(kmeans_result,
data = numeric_data,
geom = "point",
ellipse.type = "norm",
palette = c("#f0078b", "#ffd59a", "#ad29ff", "#006df2"),
ggtheme = theme_minimal(),
main = "K-means Clustering of Players - 4 Clusters")##
## 1 2 3 4
## BenchWarmer 44 13 0 9
## Starter 40 8 5 11
## Pro 19 21 14 16
## AllStar 5 11 23 20
Observations & Assumptions:
The PCA-based visualization shows that some clusters are clearly defined. Cluster 3 appears to capture low-performance players (mostly BenchWarmers), while Clusters 1 and 2 mix Pros and AllStars. Cluster 4 is more dispersed, with a blend of AllStars and Pro-level players, potentially indicating high variability in performance traits.
When comparing clusters to SalaryLevel, we observe some alignment:
BenchWarmers are highly concentrated in Cluster 3
AllStars are more spread across Clusters 1, 2, and 4
Starters mostly fall into Cluster 3, but with some overlap
This suggests that salary levels align partially with statistical profiles, but other factors may play a role (player role, age, recent performance).
pca_model <- prcomp(select(Hitters_scaled, where(is.numeric)), scale. = TRUE)
autoplot(pca_model,
data = Hitters_scaled,
colour = 'SalaryLevel',
size = 2.2,
alpha = 0.6,
shape = FALSE,
frame = TRUE,
frame.type = 'norm',
loadings = FALSE,
loadings.label = FALSE,
label = FALSE
) +
scale_color_manual(values = c(
"BenchWarmer" = "#f0078b",
"Starter" = "#048226",
"Pro" = "#006df2",
"AllStar" = "#ad29ff"
)) +
scale_x_continuous(
limits = c(-0.25, 0.3),
breaks = seq(-0.25, 0.3, by = 0.1)
) +
scale_y_continuous(
limits = c(-0.2, 0.2),
breaks = seq(-0.2, 0.2, by = 0.1)
) +
theme_minimal(base_size = 16) +
theme(
plot.title = element_text(size = 20, face = "bold", hjust = 0.5),
legend.position = "right",
legend.title = element_text(size = 14, face = "bold"),
legend.text = element_text(size = 12),
axis.text = element_text(size = 13),
panel.grid.major = element_line(color = "gray90"),
legend.box = "vertical"
) +
labs(
title = "PCA: Player Stats Colored by Salary Level",
x = "Principal Component 1",
y = "Principal Component 2",
color = "Salary Level"
)Observations & Assumptions:
PC1 (X-axis) captures the largest share of variance and is heavily influenced by career-long cumulative stats such as: CWalks, CHits, CRuns, CHmRun, and Years
PC2 (Y-axis) captures smaller but still meaningful variance, associated with performance impact and salary indicators, including: RBI, Hits, and Salary
These components help us understand the general structure of the data using only two axes, instead of many separate stats.
The PCA plot effectively separates BenchWarmers from AllStars, showing that SalaryLevel reflects actual performance patterns. There is some overlap between Starters and Pros, which might indicate a more subtle difference in their profiles. Variables like CHits, CWalks, and Years are the most influential in defining performance clusters.
BenchWarmers are mostly separated from other classes, suggesting distinct lower performance profiles.
AllStars are slightly shifted to the right, suggesting stronger statistics along PC1, which is probably related to their achievements in their careers.
Hitters_scaled %>%
group_by(SalaryLevel) %>%
summarise(across(where(is.numeric), mean)) %>%
pivot_longer(-SalaryLevel) %>%
ggplot(aes(x = SalaryLevel, y = value, fill = SalaryLevel)) +
geom_col(position = "dodge") +
facet_wrap(~ name, scales = "free") +
theme_minimal(base_size = 14)Observations & Assumptions:
The bar charts display the mean standardized values of player performance metrics by salary levels (BenchWarmer, Starter, Pro, AllStar).
This clearly is llustrating that higher salary levels like “AllStar” are consistently associated with positive performance metrics such as Hits, RBI, Runs, and Years
The visualization below presents a scatterplot matrix of key player performance metrics, colored by SalaryLevel. This matrix allows us to explore pairwise relationships, distributions, and correlation strength across player groups.
Observations & Assumptions:
These suggest that players who score more runs also tend to earn more RBIs and have more hits, which aligns with baseball performance logic.
Here, we wanted to assess the average performance of our players across the 4 Salary Levels and compare them. Through this radar plot the visualizition is compounded and we can better understand the relationships.
radar_vars <- c("Hits", "HmRun", "Runs", "RBI", "Walks", "PutOuts", "Errors", "Years")
radar_data <- Hitters_scaled %>%
group_by(SalaryLevel) %>%
summarise(across(all_of(radar_vars), mean))
radar_scaled <- as.data.frame(scale(radar_data[,-1]))
rownames(radar_scaled) <- radar_data$SalaryLevel
radar_scaled <- rbind(
rep(1, length(radar_vars)),
rep(0, length(radar_vars)),
radar_scaled
)
rownames(radar_scaled)[1:2] <- c("Max", "Min")
colors_border <- c("#f5427b", "#f4c86d", "#f1ed27", "#15c6ed")
radarchart(radar_scaled,
pcol = colors_border,
plwd = 2,
plty = 1,
cglcol = "grey",
cglty = 1,
axislabcol = "black",
caxislabels = seq(0, 1, 0.2),
vlcex = 0.8,
title = "Radar Plot - Average Player Profile by Salary Level")
legend(x = "topright", legend = rownames(radar_scaled)[3:6],
bty = "n", pch = 20, col = colors_border, text.col = "black", cex = 0.9)Observations & Assumptions:
This radar plot clearly illustrates the statistical hierarchy among SalaryLevel groups:
There is a clear progression from BenchWarmer to Starter to Pro to AllStar.
The broad spread of the AllStar profile confirms their elite skill level in nearly every area.
Radar plots like this support feature selection by showing which variables may be most helpful in distinguishing between classes.
The high error rate in BenchWarmers is particularly notable and may serve as a red flag or predictive feature in modeling.
To prepare for our model training and evaluation, we split our scaled dataset into training and test sets. Applying a stratified split to protect the distribution of classes, we did a 70% train & 30% test split assignment.
We proceeded to fit a multinomial logistic regression model on our training data utilizing our categorical and numeric features relating to category and performance. This way, were able to estimate the probability of each player identifying in one of the four Salary Levels that we determined, earlier.
Hitters_test1 <- Hitters_scaled
Hitters_test1$SalaryLevel <- factor(Hitters_test1$SalaryLevel,
levels = c("BenchWarmer", "Starter", "Pro", "AllStar"))
set.seed(123)
train_index <- createDataPartition(Hitters_test1$SalaryLevel, p = 0.7, list = FALSE)
train_data <- Hitters_test1[train_index, ]
test_data <- Hitters_test1[-train_index, ]
# Fit multinomial logistic regression
full_model <- multinom(SalaryLevel ~ AtBat + Hits + HmRun + Runs + RBI +
Walks + Years + PutOuts + Assists + Errors +
League + Division + NewLeague,
data = train_data, trace = FALSE)
# Tidy up coefficients and style the table
tidy_model <- tidy(full_model) %>%
mutate(
estimate = round(estimate, 3),
std.error = round(std.error, 3),
statistic = round(statistic, 3),
p.value = ifelse(p.value < 0.0001, "< 0.0001", round(p.value, 4)),
Significance = case_when(
p.value == "< 0.0001" ~ "***",
as.numeric(p.value) < 0.01 ~ "**",
as.numeric(p.value) < 0.05 ~ "*",
TRUE ~ ""
)
)We created a table to demonstrate estimated coefficients from the multinomial logistic regression model to show how each predictor influences the probability of aligning with a certain Salary Level. We can see the:
estimate
standard error
z-statistic
p-value
significance stars
The more significant predictors have 1 - 3 stars and indicate their contribution to distinguishing the salary groups.
# 1. Coefficients Summary
tidy_model %>%
select(y.level, term, estimate, std.error, statistic, p.value, Significance) %>%
kable("html", caption = "Multinomial Logistic Regression Summary",
col.names = c("Target Level", "Term", "Estimate", "Std. Error", "Z-Stat", "p-value", "Sig."),
align = "c") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center") %>%
scroll_box(height = "350px")| Target Level | Term | Estimate | Std. Error | Z-Stat | p-value | Sig. |
|---|---|---|---|---|---|---|
| Starter | (Intercept) | 3.329 | 1.008 | 3.301 | 0.001 | ** |
| Starter | AtBat | -3.495 | 1.684 | -2.076 | 0.0379 |
|
| Starter | Hits | 1.268 | 1.717 | 0.738 | 0.4602 | |
| Starter | HmRun | 0.481 | 0.882 | 0.545 | 0.5858 | |
| Starter | Runs | 1.591 | 1.199 | 1.327 | 0.1845 | |
| Starter | RBI | 1.085 | 1.032 | 1.052 | 0.2927 | |
| Starter | Walks | 0.435 | 0.581 | 0.748 | 0.4548 | |
| Starter | Years | 5.294 | 1.206 | 4.390 | < 0.0001 | *** |
| Starter | PutOuts | -0.004 | 0.582 | -0.007 | 0.9948 | |
| Starter | Assists | 1.209 | 0.620 | 1.948 | 0.0514 | |
| Starter | Errors | -1.173 | 0.536 | -2.187 | 0.0288 |
|
| Starter | LeagueN | 3.773 | 1.739 | 2.170 | 0.03 |
|
| Starter | DivisionW | 0.313 | 0.660 | 0.474 | 0.6353 | |
| Starter | NewLeagueN | -3.224 | 1.608 | -2.005 | 0.045 |
|
| Pro | (Intercept) | 3.241 | 1.022 | 3.172 | 0.0015 | ** |
| Pro | AtBat | -4.252 | 1.764 | -2.410 | 0.0159 |
|
| Pro | Hits | 2.461 | 1.766 | 1.394 | 0.1634 | |
| Pro | HmRun | 0.797 | 0.918 | 0.869 | 0.3849 | |
| Pro | Runs | 1.194 | 1.241 | 0.963 | 0.3358 | |
| Pro | RBI | 0.684 | 1.073 | 0.637 | 0.5242 | |
| Pro | Walks | 0.950 | 0.591 | 1.610 | 0.1075 | |
| Pro | Years | 5.934 | 1.221 | 4.861 | < 0.0001 | *** |
| Pro | PutOuts | 0.959 | 0.560 | 1.713 | 0.0867 | |
| Pro | Assists | 1.221 | 0.642 | 1.902 | 0.0571 | |
| Pro | Errors | -0.858 | 0.551 | -1.556 | 0.1196 | |
| Pro | LeagueN | 2.563 | 2.000 | 1.281 | 0.2 | |
| Pro | DivisionW | 0.446 | 0.704 | 0.634 | 0.5258 | |
| Pro | NewLeagueN | -1.407 | 1.873 | -0.751 | 0.4527 | |
| AllStar | (Intercept) | 2.950 | 1.056 | 2.793 | 0.0052 | ** |
| AllStar | AtBat | -3.395 | 1.866 | -1.819 | 0.0689 | |
| AllStar | Hits | 2.350 | 1.838 | 1.279 | 0.201 | |
| AllStar | HmRun | 0.840 | 0.953 | 0.882 | 0.378 | |
| AllStar | Runs | 1.015 | 1.301 | 0.780 | 0.4351 | |
| AllStar | RBI | 0.864 | 1.118 | 0.773 | 0.4396 | |
| AllStar | Walks | 0.969 | 0.610 | 1.589 | 0.112 | |
| AllStar | Years | 6.291 | 1.234 | 5.099 | < 0.0001 | *** |
| AllStar | PutOuts | 1.414 | 0.581 | 2.435 | 0.0149 |
|
| AllStar | Assists | 0.866 | 0.672 | 1.289 | 0.1976 | |
| AllStar | Errors | -0.558 | 0.590 | -0.947 | 0.3438 | |
| AllStar | LeagueN | 1.747 | 2.145 | 0.814 | 0.4156 | |
| AllStar | DivisionW | -0.282 | 0.765 | -0.368 | 0.7126 | |
| AllStar | NewLeagueN | -0.787 | 2.028 | -0.388 | 0.698 |
We visualized a traditional confusion matrix to compare the predicted classifications versus the actual salary levels in the test data set that we set aside.
Observations & Assumptions:
The model performs best on the BenchWarmer class, with 15 correctly predicted out of 23. Starter and Pro also have a higher rate of misclassification, many got confused between their adjacent classes.
AllStar had a decent performance with 8 correctly identified out of 17, but several were classified lower than need be.
# 2. Predictions & Evaluation
predictions <- predict(full_model, newdata = test_data)
conf_matrix <- caret::confusionMatrix(as.factor(predictions), test_data$SalaryLevel)
# 3. Confusion Matrix (Traditional Layout)
table(Predicted = predictions, Actual = test_data$SalaryLevel) %>%
kable("html", caption = "Confusion Matrix: Multinomial Logistic Regression") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center") %>%
scroll_box(width = "100%", height = "300px")| BenchWarmer | Starter | Pro | AllStar | |
|---|---|---|---|---|
| BenchWarmer | 15 | 6 | 0 | 0 |
| Starter | 2 | 7 | 9 | 4 |
| Pro | 0 | 3 | 8 | 5 |
| AllStar | 2 | 3 | 4 | 8 |
We provided sensitivy (recall), specificity, precision, F1 score, and other metrics, per class, being our 4 different Salary Levels.
Observations & Assumptions:
BenchWarmer has the highest sensitivity at 0.789 and an F1 score of 0.75, indicating the model is best of this identification
Starter performs poorly almost everywhere, suggesting that it struggles to distinguish it.
Pro and AllStar are moderate with F1 scores of 0.43 and 0.47
Overall, we can gather that there is a lot of overall and confusion between categories, as Salary is a an expansive and confounded variable to predict.
The null accuracy of always guessing the most common class if 0.276, so our model does much better in its prediction, on a significant scale.
# 4. Per-Class Metrics (Sensitivity, Specificity, etc.)
as.data.frame(conf_matrix$byClass) %>%
rownames_to_column("Class") %>%
kable("html", caption = "Per-Class Performance Metrics (e.g. Sensitivity, Specificity)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center") %>%
scroll_box(width = "100%", height = "300px")| Class | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Class: BenchWarmer | 0.7894737 | 0.8947368 | 0.7142857 | 0.9272727 | 0.7142857 | 0.7894737 | 0.7500000 | 0.2500000 | 0.1973684 | 0.2763158 | 0.8421053 |
| Class: Starter | 0.3684211 | 0.7368421 | 0.3181818 | 0.7777778 | 0.3181818 | 0.3684211 | 0.3414634 | 0.2500000 | 0.0921053 | 0.2894737 | 0.5526316 |
| Class: Pro | 0.3809524 | 0.8545455 | 0.5000000 | 0.7833333 | 0.5000000 | 0.3809524 | 0.4324324 | 0.2763158 | 0.1052632 | 0.2105263 | 0.6177489 |
| Class: AllStar | 0.4705882 | 0.8474576 | 0.4705882 | 0.8474576 | 0.4705882 | 0.4705882 | 0.4705882 | 0.2236842 | 0.1052632 | 0.2236842 | 0.6590229 |
# 5. Overall Model Metrics (Accuracy, Kappa, etc.)
as.data.frame(t(conf_matrix$overall)) %>%
kable("html", caption = "Overall Model Performance Metrics") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center") %>%
scroll_box(width = "100%", height = "200px")| Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull | AccuracyPValue | McnemarPValue |
|---|---|---|---|---|---|---|
| 0.5 | 0.333641 | 0.3830267 | 0.6169733 | 0.2763158 | 2.95e-05 | NaN |
To decide on a cross-validation method, we sampled based on the birthday closest to August 21st.
The closest was Chloe’s birthday, so we set this random seed to 18052005.
set.seed(18052005)
cv_methods <- c("1. Vanilla validation set",
"2. Leave-One-Out CV (LOOCV)",
"3. K-fold CV (K = 5)",
"4. K-fold CV (K = 10)")
chosen_cv <- sample(cv_methods, 1)
# Convert to table for clean HTML output
tibble(`Chosen Cross-Validation Method` = chosen_cv) %>%
kable("html", caption = "Randomly Selected Cross-Validation Method") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)| Chosen Cross-Validation Method |
|---|
|
Due to this selection process, we fit our model to this Cross-Validation method: Vanilla Validation set.
This method simply splits the data into a 70-30 train-test assignment using a data partition to preserve the proportions. Salary Level remains a target variable.
We then tested a multinomial logistic regression model on this training data and made predictions on the testing set.
To further evaluate our metrics and performance we outputted a confusion matrix.
set.seed(123)
train_index <- createDataPartition(Hitters_scaled$SalaryLevel, p = 0.7, list = FALSE)
train_data <- Hitters_scaled[train_index, ]
test_data <- Hitters_scaled[-train_index, ]
full_model <- multinom(SalaryLevel ~ ., data = train_data, trace = FALSE)
preds <- predict(full_model, newdata = test_data)
conf_mat <- confusionMatrix(preds, test_data$SalaryLevel)
conf_matrix_df <- as.data.frame(conf_mat$table)
kable(conf_matrix_df, "html", caption = "Confusion Matrix: Multinomial Logistic Regression") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center"
) %>%
scroll_box(width = "100%", height = "300px")| Prediction | Reference | Freq |
|---|---|---|
| BenchWarmer | BenchWarmer | 13 |
| Starter | BenchWarmer | 4 |
| Pro | BenchWarmer | 0 |
| AllStar | BenchWarmer | 2 |
| BenchWarmer | Starter | 4 |
| Starter | Starter | 11 |
| Pro | Starter | 2 |
| AllStar | Starter | 2 |
| BenchWarmer | Pro | 0 |
| Starter | Pro | 10 |
| Pro | Pro | 6 |
| AllStar | Pro | 5 |
| BenchWarmer | AllStar | 2 |
| Starter | AllStar | 3 |
| Pro | AllStar | 4 |
| AllStar | AllStar | 8 |
byclass_df <- as.data.frame(conf_mat$byClass)
byclass_df <- tibble::rownames_to_column(byclass_df, "Class")
kable(byclass_df, "html", caption = "Per-Class Classification Metrics") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center"
) %>%
scroll_box(width = "100%", height = "300px")| Class | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Class: BenchWarmer | 0.6842105 | 0.8947368 | 0.6842105 | 0.8947368 | 0.6842105 | 0.6842105 | 0.6842105 | 0.2500000 | 0.1710526 | 0.2500000 | 0.7894737 |
| Class: Starter | 0.5789474 | 0.7017544 | 0.3928571 | 0.8333333 | 0.3928571 | 0.5789474 | 0.4680851 | 0.2500000 | 0.1447368 | 0.3684211 | 0.6403509 |
| Class: Pro | 0.2857143 | 0.8909091 | 0.5000000 | 0.7656250 | 0.5000000 | 0.2857143 | 0.3636364 | 0.2763158 | 0.0789474 | 0.1578947 | 0.5883117 |
| Class: AllStar | 0.4705882 | 0.8474576 | 0.4705882 | 0.8474576 | 0.4705882 | 0.4705882 | 0.4705882 | 0.2236842 | 0.1052632 | 0.2236842 | 0.6590229 |
overall_df <- as.data.frame(t(conf_mat$overall))
kable(overall_df, "html", caption = "Overall Model Performance Metrics") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center"
) %>%
scroll_box(width = "100%", height = "200px")| Accuracy | Kappa | AccuracyLower | AccuracyUpper | AccuracyNull | AccuracyPValue | McnemarPValue |
|---|---|---|---|---|---|---|
| 0.5 | 0.3348687 | 0.3830267 | 0.6169733 | 0.2763158 | 2.95e-05 | NaN |
Observations & Assumptions: The model is reasonably accurate on BenchWarmer and Allstar, but Starter and Pro are too confused with each other.
The model accurately classified 50% of the test set, with a Kappa of 0.33 and performs significantly better than simply guessing, as our p-value < 0.05.
The class-specific metrics display high sensitivity for BenchWarmers, but relatively poor performance for the Pros.
To complement the randomly selected CV method, we chose K-Fold Cross-Validation (K=10) as our second evaluation method. We trained a multinomial logistic regression model using the caret::train() function with 10-fold CV.
We tested three values of the regularization parameter
decay:
0 (no regularization)
1e-4 (light regularization)
1e-1 (stronger regularization)
The best performing configuration used decay = 1e-4, yielding:
Accuracy: 87.6%
Kappa: 0.83
Lowest standard deviation, indicating stable performance across folds.
This validates our model’s strong generalization capability beyond the training data.
cv_control <- trainControl(method = "cv", number = 10)
set.seed(123)
cv_model <- train(SalaryLevel ~ ., data = Hitters_scaled,
method = "multinom",
trControl = cv_control,
trace = FALSE)
cv_model$results %>%
kable("html", caption = "10-Fold Cross-Validation Results for Multinomial Logistic Regression") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center") %>%
scroll_box(width = "100%", height = "250px")| decay | Accuracy | Kappa | AccuracySD | KappaSD |
|---|---|---|---|---|
| 0e+00 | 0.5714188 | 0.4268895 | 0.0544096 | 0.0721937 |
| 1e-04 | 0.5752650 | 0.4317266 | 0.0465745 | 0.0627250 |
| 1e-01 | 0.5909573 | 0.4527177 | 0.0800982 | 0.1084634 |
Observations & Assumptions: Decay is used to regularize and higher values shrink the coefficients more as to reduce overfitting. The best performing model is with decay = 0.1, with an accuracy of 59 percent at a 0.45 Kappa.
This suggests that a moderate level of regularization improves the ability of the model to generate, and that the model is relatively consistent and stable!
We employed a K-fold CV with K = 10 using cv.glmnet() for LASSO model refinement. To refine the multinomial logistic regression model and enhance its effects, we applied LASSO regularization using the same 10-fold CV used above.
This method penalizes less important coefficients, streamlining the performance variable selection!
x_train <- model.matrix(SalaryLevel ~ ., train_data)[, -1]
y_train <- train_data$SalaryLevel
set.seed(123)
cv_lasso <- cv.glmnet(x_train, y_train, family = "multinomial", type.measure = "class", nfolds = 10)
plot(cv_lasso)lambda_min <- cv_lasso$lambda.min
lambda_1se <- cv_lasso$lambda.1se
data.frame(
Lambda_Type = c("Best Lambda (min)", "1-SE Lambda"),
Value = c(round(lambda_min, 6), round(lambda_1se, 6))
) %>%
kable("html", caption = "Selected Lambda Values from LASSO Cross-Validation") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center")| Lambda_Type | Value |
|---|---|
| Best Lambda (min) | 0.021110 |
| 1-SE Lambda | 0.036891 |
The cross-validation plot shows the misclassification error across various values of log(λ).
Observations & Assumptions: Two key λ values were identified:
Lambda.min: 0.0211, the lowest classification error
Lambda.1se: 0.0369, the more conservative choice for a balance.
x_train <- model.matrix(SalaryLevel ~ ., train_data)[, -1]
y_train <- train_data$SalaryLevel
set.seed(123)
cv_lasso <- cv.glmnet(x_train, y_train, family = "multinomial", type.measure = "class", nfolds = 10)
lambda_min <- cv_lasso$lambda.min
lambda_1se <- cv_lasso$lambda.1se
lasso_min_model <- glmnet(x_train, y_train, family = "multinomial", lambda = lambda_min)
lasso_1se_model <- glmnet(x_train, y_train, family = "multinomial", lambda = lambda_1se)
x_test <- model.matrix(SalaryLevel ~ ., test_data)[, -1]
y_test <- test_data$SalaryLevel
pred_min <- predict(lasso_min_model, newx = x_test, type = "class")
pred_1se <- predict(lasso_1se_model, newx = x_test, type = "class")
pred_min <- factor(pred_min, levels = levels(y_test))
pred_1se <- factor(pred_1se, levels = levels(y_test))
y_test <- factor(y_test, levels = levels(y_test))
data.frame(
Variable = c("pred_min", "y_test"),
Levels = I(list(levels(pred_min), levels(y_test)))
) %>%
unnest_longer(Levels) %>%
kable("html", caption = "Levels of Predicted and Actual Classes") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center")| Variable | Levels |
|---|---|
| pred_min | BenchWarmer |
| pred_min | Starter |
| pred_min | Pro |
| pred_min | AllStar |
| y_test | BenchWarmer |
| y_test | Starter |
| y_test | Pro |
| y_test | AllStar |
conf_min <- confusionMatrix(pred_min, y_test)
conf_1se <- confusionMatrix(pred_1se, y_test)
accuracy_lasso_min <- conf_min$overall["Accuracy"]
kappa_lasso_min <- conf_min$overall["Kappa"]
accuracy_lasso_1se <- conf_1se$overall["Accuracy"]
kappa_lasso_1se <- conf_1se$overall["Kappa"]
lasso_results <- data.frame(
Model = c("LASSO (lambda.min)", "LASSO (lambda.1se)"),
Lambda = c(round(lambda_min, 6), round(lambda_1se, 6)),
Accuracy = c(round(conf_min$overall["Accuracy"], 3), round(conf_1se$overall["Accuracy"], 3)),
Kappa = c(round(conf_min$overall["Kappa"], 3), round(conf_1se$overall["Kappa"], 3))
)
kable(lasso_results, caption = "Performance Comparison of LASSO Models with Different Lambda Values") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center") %>%
scroll_box(width = "100%", height = "200px")| Model | Lambda | Accuracy | Kappa |
|---|---|---|---|
| LASSO (lambda.min) | 0.021110 | 0.513 | 0.351 |
| LASSO (lambda.1se) | 0.036891 | 0.487 | 0.316 |
Observations & Assumptions:
The cross-validation plot shows the misclassification error across various values of log(λ).
Lambda.min achieved better results with an accuracy of 51.3% at a kappa of 0.351, compared to the 48.7% accuracy at a kappa of 0.316
These results demonstrate the trade-off between model complexity and generalizability. To prevent overfitting, the 1-SE rule suggests choosing the largest λ within 1 standard error of the minimum error (λ = lambda.1se).
This balances predictive performance with model simplicity.
We trained this model with 100 trees, at 3 variables considered per split and maximum of 10 terminal nodes per tree.
library(randomForest)
set.seed(123)
train_index <- createDataPartition(Hitters_scaled$SalaryLevel, p = 0.7, list = FALSE)
train_data <- Hitters_scaled[train_index, ]
test_data <- Hitters_scaled[-train_index, ]
train_data$SalaryLevel <- as.factor(train_data$SalaryLevel)
test_data$SalaryLevel <- as.factor(test_data$SalaryLevel)
set.seed(123)
rf_model <- randomForest(SalaryLevel ~ .,
data = train_data,
ntree = 100,
mtry = 3,
maxnodes = 10)
rf_pred <- predict(rf_model, newdata = test_data)
conf_rf <- confusionMatrix(rf_pred, test_data$SalaryLevel)
rf_results <- data.frame(
Metric = c("Accuracy", "Kappa"),
Value = c(round(conf_rf$overall["Accuracy"], 3), round(conf_rf$overall["Kappa"], 3))
)
kable(rf_results, caption = "Random Forest Model Performance Metrics") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center") %>%
scroll_box(width = "100%", height = "200px")| Metric | Value | |
|---|---|---|
| Accuracy | Accuracy | 0.605 |
| Kappa | Kappa | 0.472 |
as.data.frame(conf_rf$table) %>%
kable("html", caption = "Confusion Matrix: Random Forest") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center"
) %>%
scroll_box(width = "100%", height = "300px")| Prediction | Reference | Freq |
|---|---|---|
| BenchWarmer | BenchWarmer | 16 |
| Starter | BenchWarmer | 2 |
| Pro | BenchWarmer | 1 |
| AllStar | BenchWarmer | 0 |
| BenchWarmer | Starter | 4 |
| Starter | Starter | 8 |
| Pro | Starter | 7 |
| AllStar | Starter | 0 |
| BenchWarmer | Pro | 0 |
| Starter | Pro | 3 |
| Pro | Pro | 11 |
| AllStar | Pro | 7 |
| BenchWarmer | AllStar | 0 |
| Starter | AllStar | 0 |
| Pro | AllStar | 6 |
| AllStar | AllStar | 11 |
Observations & Assumptions:
The Random Forest model achieved an accuracy of 0.605, which indicates that 60.5% of predictions on the testing set were correct. Compared to earlier models this shows a noticeable improvement in handling class imbalances. The model also performs better on the BenchWarmer and AllStar classes, in terms of classification.
We initially excluded XGBoost due to overfitting concerns with perfect accuracy. After tuning hyperparameters and converting categorical predictors to numeric, we re-evaluate it for realistic performance. We mutate the categorical “factor” points to be numerical for this model.
label_map <- levels(train_data$SalaryLevel)
y_train_xgb <- as.numeric(train_data$SalaryLevel) - 1
y_test_xgb <- as.numeric(test_data$SalaryLevel) - 1
x_train_xgb <- model.matrix(SalaryLevel ~ . , data = train_data)[, -1]
x_test_xgb <- model.matrix(SalaryLevel ~ . , data = test_data)[, -1]
dtrain <- xgb.DMatrix(data = x_train_xgb, label = y_train_xgb)
dtest <- xgb.DMatrix(data = x_test_xgb, label = y_test_xgb)
params <- list(
objective = "multi:softprob",
num_class = length(unique(y_train_xgb)),
eval_metric = "mlogloss",
eta = 0.1,
max_depth = 4,
subsample = 0.7,
colsample_bytree = 0.7
)
set.seed(123)
xgb_model <- xgb.train(
params = params,
data = dtrain,
nrounds = 100,
watchlist = list(train = dtrain),
verbose = 0
)
xgb_preds <- predict(xgb_model, dtest)
xgb_pred_labels <- max.col(matrix(xgb_preds, nrow = length(y_test_xgb), byrow = TRUE)) - 1
pred_factor <- factor(label_map[xgb_pred_labels + 1], levels = label_map)
y_test_factor <- factor(label_map[y_test_xgb + 1], levels = label_map)
conf_xgb <- confusionMatrix(pred_factor, y_test_factor)
acc_xgb <- conf_xgb$overall["Accuracy"]
kappa_xgb <- conf_xgb$overall["Kappa"]
xgb_metrics <- data.frame(
Metric = c("Accuracy", "Kappa"),
Value = c(round(acc_xgb, 3), round(kappa_xgb, 3))
)
kable(xgb_metrics, caption = "Tuned XGBoost Performance Summary") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center") %>%
scroll_box(width = "100%", height = "200px")| Metric | Value | |
|---|---|---|
| Accuracy | Accuracy | 0.671 |
| Kappa | Kappa | 0.559 |
as.data.frame(conf_xgb$table) %>%
kable("html", caption = "Confusion Matrix: Tuned XGBoost") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center") %>%
scroll_box(width = "100%", height = "300px")| Prediction | Reference | Freq |
|---|---|---|
| BenchWarmer | BenchWarmer | 17 |
| Starter | BenchWarmer | 1 |
| Pro | BenchWarmer | 1 |
| AllStar | BenchWarmer | 0 |
| BenchWarmer | Starter | 5 |
| Starter | Starter | 10 |
| Pro | Starter | 4 |
| AllStar | Starter | 0 |
| BenchWarmer | Pro | 0 |
| Starter | Pro | 3 |
| Pro | Pro | 14 |
| AllStar | Pro | 4 |
| BenchWarmer | AllStar | 0 |
| Starter | AllStar | 0 |
| Pro | AllStar | 7 |
| AllStar | AllStar | 10 |
as.data.frame(conf_xgb$byClass) %>%
rownames_to_column("Class") %>%
kable("html", caption = "Per-Class Performance Metrics (Tuned XGBoost)") %>%
kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE, position = "center") %>%
scroll_box(width = "100%", height = "300px")| Class | Sensitivity | Specificity | Pos Pred Value | Neg Pred Value | Precision | Recall | F1 | Prevalence | Detection Rate | Detection Prevalence | Balanced Accuracy |
|---|---|---|---|---|---|---|---|---|---|---|---|
| Class: BenchWarmer | 0.8947368 | 0.9122807 | 0.7727273 | 0.9629630 | 0.7727273 | 0.8947368 | 0.8292683 | 0.2500000 | 0.2236842 | 0.2894737 | 0.9035088 |
| Class: Starter | 0.5263158 | 0.9298246 | 0.7142857 | 0.8548387 | 0.7142857 | 0.5263158 | 0.6060606 | 0.2500000 | 0.1315789 | 0.1842105 | 0.7280702 |
| Class: Pro | 0.6666667 | 0.7818182 | 0.5384615 | 0.8600000 | 0.5384615 | 0.6666667 | 0.5957447 | 0.2763158 | 0.1842105 | 0.3421053 | 0.7242424 |
| Class: AllStar | 0.5882353 | 0.9322034 | 0.7142857 | 0.8870968 | 0.7142857 | 0.5882353 | 0.6451613 | 0.2236842 | 0.1315789 | 0.1842105 | 0.7602193 |
eval_results <- data.frame(
Model = c("Vanilla CV", "10-Fold CV", "LASSO (lambda.min)", "LASSO (lambda.1se)", "Random Forest", "XGBoost"),
Accuracy = c(
conf_mat$overall["Accuracy"],
cv_model$results$Accuracy[2],
accuracy_lasso_min,
accuracy_lasso_1se,
conf_rf$overall["Accuracy"], # Replaces rf_accuracy
round(acc_xgb, 3)
),
Kappa = c(
conf_mat$overall["Kappa"],
cv_model$results$Kappa[2],
kappa_lasso_min,
kappa_lasso_1se,
conf_rf$overall["Kappa"], # Replaces rf_kappa
round(kappa_xgb, 3)
)
)
# Display the table nicely
eval_results %>%
mutate(across(c(Accuracy, Kappa), round, 3)) %>%
kbl(caption = "Model Comparison: Accuracy and Kappa") %>%
kable_styling(
bootstrap_options = c("striped", "hover", "condensed", "responsive"),
full_width = FALSE,
position = "center"
)| Model | Accuracy | Kappa |
|---|---|---|
| Vanilla CV | 0.500 | 0.335 |
| 10-Fold CV | 0.575 | 0.432 |
| LASSO (lambda.min) | 0.513 | 0.351 |
| LASSO (lambda.1se) | 0.487 | 0.316 |
| Random Forest | 0.605 | 0.472 |
| XGBoost | 0.671 | 0.559 |
plot_data <- pivot_longer(eval_results, cols = c(Accuracy, Kappa),
names_to = "Metric", values_to = "Score")
ggplot(plot_data, aes(x = Model, y = Score, fill = Metric)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
labs(title = "Comparison of Models - Accuracy & Kappa",
x = "Model",
y = "Score") +
scale_fill_manual(values = c("orchid", "#bd063d")) +
theme_minimal() +
theme(axis.text.x = element_text(angle = 15, hjust = 1),
plot.title = element_text(face = "bold"))Observations & Assumptions:
XGBoost has the highest accuracy at 0.67 and the highest Kappa at 0.56.
LASSO with the lambda.min performs much better than lambda.1se as it fits more closely to the data, though this could risk some overfitting.
LASSO with the lambda.1se is the worst performing model, it favors model simplicity, and thus works poorly here.
We wanted to visualize the Top 15 most important features selected, tuned in two different lambdas again.
clean_feature_name <- function(f) {
gsub("^(.*?)([:\\._].*)?$", "\\1", f)
}
extract_nonzero_coefs <- function(coef_obj, lambda_type) {
purrr::map_df(coef_obj, ~ as.data.frame(as.matrix(.x)) %>%
rownames_to_column("Feature"),
.id = "Class") %>%
pivot_longer(cols = -c(Feature, Class), names_to = "Lambda", values_to = "Coefficient") %>%
filter(Coefficient != 0, Feature != "(Intercept)") %>%
mutate(
Feature = clean_feature_name(Feature),
Importance = abs(Coefficient),
Model = lambda_type
)
}
x <- model.matrix(SalaryLevel ~ ., data = train_data)[, -1]
y <- train_data$SalaryLevel
cv_model <- cv.glmnet(x, y, alpha = 1, family = "multinomial", type.measure = "class")
coef_min <- coef(cv_model, s = "lambda.min")
coef_1se <- coef(cv_model, s = "lambda.1se")
importance_min <- extract_nonzero_coefs(coef_min, "lambda.min")
importance_1se <- extract_nonzero_coefs(coef_1se, "lambda.1se")
importance_combined <- bind_rows(importance_min, importance_1se)
top_n_features <- importance_combined %>%
group_by(Feature) %>%
summarise(TotalImportance = sum(Importance)) %>%
arrange(desc(TotalImportance)) %>%
slice_head(n = 15) %>%
pull(Feature)
importance_filtered <- importance_combined %>%
filter(Feature %in% top_n_features) %>%
mutate(Feature = factor(Feature, levels = rev(top_n_features)))
ggplot(importance_filtered, aes(x = Feature, y = Importance, fill = Model)) +
geom_bar(stat = "identity", position = position_dodge(width = 0.8)) +
coord_flip() +
scale_fill_manual(values = c("lambda.min" = "#cba3e8", "lambda.1se" = "#e6ccf5")) +
labs(title = "Feature Importance: LASSO (lambda.min vs lambda.1se)",
x = "Feature", y = "Summed Absolute Coefficient",
fill = "Model") +
theme_minimal() +
theme(axis.text.y = element_text(size = 10),
plot.title = element_text(face = "bold", hjust = 0.5))Observations & Assumptions:
Years is consistently the most influential across both models. This plot shows the trade-off between interpretability and predictive power.
tidy_model %>%
filter(term != "(Intercept)") %>%
group_by(term) %>%
summarise(Importance = sum(abs(estimate)), .groups = "drop") %>%
ggplot(aes(x = reorder(term, Importance), y = Importance)) +
geom_bar(stat = "identity", fill = "#e889d8") +
coord_flip() +
labs(title = "Feature Importance: Multinomial Logistic Regression - Aggregated",
x = "Feature",
y = "Summed Absolute Coefficient") +
theme_minimal()Observations & Assumptions:
Years, AtBat, and LeagueN show the strongest influence on the model’s predictions, possibly meaning that experience and league is a key differentiating factor in determining the Salary Level of a player.
importance_rf <- randomForest::importance(rf_model)
importance_df <- as.data.frame(importance_rf) %>%
tibble::rownames_to_column("Feature") %>%
filter(Feature != "Salary") %>% # <-- Exclude Salary
arrange(desc(MeanDecreaseGini))
ggplot(importance_df, aes(x = reorder(Feature, MeanDecreaseGini), y = MeanDecreaseGini)) +
geom_bar(stat = "identity", fill = "#898ce8") +
coord_flip() +
labs(title = "Feature Importance: Random Forest",
x = "Feature",
y = "Mean Decrease in Gini") +
theme_minimal()Observations & Assumptions:
CRuns, CAtBat, and CHits are the most important, meaning possibly that season statistics like Total Runs or Hits heavily influence the decision-making of the model.
importance_matrix <- xgb.importance(model = xgb_model)
ggplot(importance_matrix[1:15, ], aes(x = reorder(Feature, Gain), y = Gain)) +
geom_bar(stat = "identity", fill = "#7ed9b6") +
coord_flip() +
labs(title = "XGBoost Feature Importance (Top 15 Features)",
x = "Feature",
y = "Importance") +
theme_minimal() +
theme(axis.text.y = element_text(size = 10),
plot.title = element_text(face = "bold"))Observations & Assumptions:
Here we see the top 15 predictors selected by XGBoost, ranked by their mprovement in loss. CHits, CRuns, and CRBI are the most valuable features.
Similar to Random Forest, XGBoost favors those season statistics but captures all the subtle interactions and regularized effects, as an added effect.
This ROC curve displays the comparisons between these four different classification models on the set split with vanilla validation:
Visualizing the trade-off between:
True Positive Rate: Sensitivity
False Positive Rate: 1 - Specificity
The Area Under the Curve (AUC) plainly summarizes each classification model’s ability to accurately delineate the four Salary Levels
# TRUE labels
true_labels <- test_data$SalaryLevel
# Get predicted probabilities
multinom_probs <- predict(full_model, newdata = test_data, type = "prob")
lasso_probs <- predict(lasso_min_model, newx = model.matrix(SalaryLevel ~ ., test_data)[, -1], type = "response")[,,1]
rf_probs <- predict(rf_model, newdata = test_data, type = "prob")
xgb_probs <- matrix(xgb_preds, nrow = length(y_test_xgb), byrow = TRUE)
roc_list_smooth <- list(
Multinomial = roc(true_labels, multinom_probs[,1], smooth=TRUE),
LASSO = roc(true_labels, lasso_probs[,1], smooth=TRUE),
RandomForest = roc(true_labels, rf_probs[,1], smooth=TRUE),
XGBoost = roc(true_labels, xgb_probs[,1], smooth=TRUE)
)
auc_values <- sapply(roc_list_smooth, auc)
legend_labels <- paste(names(roc_list_smooth), "(AUC =", round(auc_values, 3), ")")
names(roc_list_smooth) <- legend_labels
roc_plot_smooth <- ggroc(roc_list_smooth, legacy.axes=TRUE, aes="color", size=1.5) +
ggtitle("ROC Curves Comparison with AUC") +
xlab("False Positive Rate (1 - Specificity)") +
ylab("True Positive Rate (Sensitivity)") +
theme_minimal(base_size = 18) +
geom_abline(intercept=0, slope=1, linetype="dashed", color="gray", size=1) +
scale_color_manual(values=c("#2a8bf5", "#fa73da", "#2af599", "#c873fa")) +
theme(legend.title = element_blank(),
legend.position = "right")
print(roc_plot_smooth)Observations & Assumptions:
XGBoost and Random Forest demonstrate the most skilled ability to distinguish the Salary Levels.
XGBoost achieved the greatest performance overall with the highest AUC. However, it is quite complex as a model, thus it is important to make note of the following models, as well.
Random Forest has a similar AUC, with slightly less accuracy, and is quite interpretable.
Multinomial Logistic Regression is very reliable baseline, though the accuracy was stunted in comparison to the other two. It is also quite simple and interpretable
LASSO Regression has a lowered accuracy in comparison, and utilized less variables through regularization, thus becoming more efficient and interpretable.
We employed the usage of AI to verify our doubts and thought processes in structuring and clarifying our extra steps (further analyzing our dataset). We also employed AI to debug errors and misalignments, and to improve the readability of the R Console outputs (through embellished graphs, and kable tables), that otherwise would not be interpretable or visually comprehensible.
Thank you! - Strong Entities